首页> 外文OA文献 >Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
【2h】

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

机译:在VQa中构建V:提升图像理解的作用   视觉问题回答

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Problems at the intersection of vision and language are of significantimportance both as challenging research questions and for the rich set ofapplications they enable. However, inherent structure in our world and bias inour language tend to be a simpler signal for learning than visual modalities,resulting in models that ignore visual information, leading to an inflatedsense of their capability. We propose to counter these language priors for the task of Visual QuestionAnswering (VQA) and make vision (the V in VQA) matter! Specifically, we balancethe popular VQA dataset by collecting complementary images such that everyquestion in our balanced dataset is associated with not just a single image,but rather a pair of similar images that result in two different answers to thequestion. Our dataset is by construction more balanced than the original VQAdataset and has approximately twice the number of image-question pairs. Ourcomplete balanced dataset is publicly available at www.visualqa.org as part ofthe 2nd iteration of the Visual Question Answering Dataset and Challenge (VQAv2.0). We further benchmark a number of state-of-art VQA models on our balanceddataset. All models perform significantly worse on our balanced dataset,suggesting that these models have indeed learned to exploit language priors.This finding provides the first concrete empirical evidence for what seems tobe a qualitative sense among practitioners. Finally, our data collection protocol for identifying complementary imagesenables us to develop a novel interpretable model, which in addition toproviding an answer to the given (image, question) pair, also provides acounter-example based explanation. Specifically, it identifies an image that issimilar to the original image, but it believes has a different answer to thesame question. This can help in building trust for machines among their users.
机译:视觉和语言交汇处的问题作为具有挑战性的研究问题以及它们所支持的丰富应用程序都具有重要意义。但是,与视觉模态相比,我们世界的固有结构和偏向语言的语言更容易成为学习的信号,从而导致模型忽略视觉信息,从而导致其能力膨胀。我们建议针对视觉问题答案(VQA)的任务反对这些语言先验知识,并使视觉(VQA中的V)至关重要!具体来说,我们通过收集互补图像来平衡流行的VQA数据集,以使平衡数据集中的每个问题不仅与单个图像相关,而且与一对相似的图像相关联,从而导致对该问题有两个不同的答案。通过构造,我们的数据集比原始VQA数据集更平衡,并且具有大约两倍的图像问题对数量。我们完整的平衡数据集是可视问题回答数据集和质询(VQAv2.0)第二次迭代的一部分,可在www.visualqa.org上公开获得。我们进一步在平衡数据集上对许多最新的VQA模型进行了基准测试。所有模型在我们平衡的数据集上的表现都明显较差,这表明这些模型确实已学会了先验语言。此发现为从业人员似乎具有定性意义提供了第一个具体的经验证据。最后,我们用于识别互补图像的数据收集协议使我们能够开发一种新颖的可解释模型,该模型除了提供对给定(图像,问题)对的答案外,还提供了基于反例的解释。具体来说,它识别出与原始图像相似的图像,但是它认为对同一问题有不同的答案。这有助于在用户之间建立对计算机的信任。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号